Confirming data consistency in a data storage environment

ABSTRACT

A method for confirming replicated data at a data site, including utilizing a hash function, computing a first hash value based on first data at a first data site and utilizing the same hash function, computing a second hash value based on second data at a second data site, wherein the first data had previously been replicated from the first data site to the second data site as the second data. The method also includes comparing the first and second hash values to determine whether the second data is a valid replication of the first data. In additional embodiments, the first data may be modified based on seed data prior to computing the first hash value and the second data may be modified based on the same seed data prior to computing the second hash value. The process can be repeated to increase reliability of the results.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/628,851, titled “Confirming Data Consistency in a Data StorageEnvironment,” filed Feb. 23, 2015, which is a continuation of U.S.patent application Ser. No. 13/680,265, also titled “Confirming DataConsistency in a Data Storage Environment,” filed Nov. 19, 2012, thedisclosures of which are hereby incorporated by reference herein intheir entirety.

FIELD OF THE INVENTION

The present disclosure generally relates to systems and methods forconfirming data consistency in a data storage environment. Particularly,the present disclosure relates to systems and methods for efficientlyconfirming data consistency between two or more network connected datastorage sites in a data storage subsystem or information handlingsystem, without, for example, consuming too significant amount ofnetwork bandwidth between the data sites, and which may be particularlyuseful in systems with relatively slower network links or connections.

BACKGROUND OF THE INVENTION

As the value and use of information continues to increase, individualsand businesses seek additional ways to process and store information.One option available to users is information handling systems. Aninformation handling system generally processes, compiles, stores,and/or communicates information or data for business, personal, or otherpurposes thereby allowing users to take advantage of the value of theinformation. Because technology and information handling needs andrequirements vary between different users or applications, informationhandling systems may also vary regarding what information is handled,how the information is handled, how much information is processed,stored, or communicated, and how quickly and efficiently the informationmay be processed, stored, or communicated. The variations in informationhandling systems allow for information handling systems to be general orconfigured for a specific user or specific use such as financialtransaction processing, airline reservations, enterprise data storage,or global communications. In addition, information handling systems mayinclude a variety of hardware and software components that may beconfigured to process, store, and communicate information and mayinclude one or more computer systems, data storage systems, andnetworking systems.

Likewise, individuals and businesses seek additional ways to protect orsecure information, so as to improve, for example, reliability, faulttolerance, or accessibility of that information. One such method ofprotecting information in an information handling system involvesreplicating or sharing information so as to ensure consistency betweenredundant resources, such as data storage devices, including but notlimited to, disk drives, solid state drives, tape drives, etc.Replication of information, or data, is possible across variouscomponents communicatively coupled through a computer network, so thedata storage devices may be, and often desirably are, located inphysically distant locations. One purpose of data replication,particularly remote data replication, is to prevent damage from failuresor disasters that may occur in one location, and/or in case such eventsdo occur, improve the ability to recover the data.

During conventional replication processes, when data gets transmittedfrom an originating or source site to a destination site, the data istypically, and should be, confirmed as received. Furthermore, duringtransmission, such data is also often confirmed as accurate, so as toverify the data had been transmitted and received successfully. In thisregard, various checks may be used to confirm successful transmission ofthe data, including for example, cyclic redundancy checks (CRCs), whichare specifically designed to protect against common types of errorsduring communication, and can provide quick and reasonable assurance ofthe integrity of any messages transmitted.

After transmission, however, the data at one of the locations may becomeinvalid (e.g., incorrect) over time for any number of explainable orunexplainable reasons. One manner by which it can be confirmed that datafrom a first site is consistent with data at a second site is to resendthe data from one of the sites to the other and verify the data matchesthat stored in the other site or simply rewrite the data at the othersite. However, this can obviously put a significant strain on theavailable network bandwidth between the two sites, which may otherwisebe used for other communication between the sites, such as initialreplication. The demand on bandwidth may be further compounded wherethere are numerous data storage sites scattered over various remotelocations, each attempting to confirm validity of its data.

Accordingly, what is needed are better systems and methods forconfirming data consistency in a data storage environment that overcomethe disadvantages of conventional methods for data confirmation.Particularly, what is needed are systems and methods for efficientlyconfirming data consistency between two or more network connected datastorage sites in a data storage subsystem or information handlingsystem, without, for example, consuming too significant amount ofnetwork bandwidth between the data sites, which may otherwise desirablybe reserved for other communication between the systems. Such systemsand methods could be particularly useful with, but are not limited touse in, systems with relatively slower network links or connections.

BRIEF SUMMARY OF THE INVENTION

The present disclosure, in one embodiment, relates to a method forconfirming the validity of replicated data at a data storage site. Themethod includes utilizing a hash function, computing a first hash valuebased on first data at a first data storage site and utilizing the samehash function, computing a second hash value based on second data at asecond data storage site, wherein the first data had previously beenreplicated from the first data storage site to the second data storagesite as the second data. The method also includes comparing the firstand second hash values to determine whether the second data is a validreplication of the first data. Typically, the first and second datastorage sites are remotely connected by a network. As such, the methodmay include transmitting either the first or second hash values via thenetwork for comparing with the other hash value. Likewise, the hashfunction may also be transmitted via the network from either the firstor second data storage site to the other storage site, which can helpensure the same hash function is used be each. In other embodiments, adata structure, such as a table, may be provided for storing a pluralityof hash functions, each being available for use by the first and seconddata storage sites. In this regard, the method may further includeselecting the hash function for use from the table for utilization incomputing the first and second hash values. In additional or alternativeembodiments, the first data may be modified based on seed data prior tocomputing the first hash value and the second data may be modified basedon the same seed data prior to computing the second hash value. The seeddata may also be transmitted via the network from either the first orsecond data storage site to the other, which can help ensure the sameseed data is used be each. In some embodiments, the process can berepeated any suitable number of times, each time utilizing a differenthash function than in a previous time. Doing so can increase thereliability of the confirmation results. The process can be repeatedaccording to any manner; however, in one embodiment, the process isrepeated according to a predetermined periodic cycle.

The present disclosure, in another embodiment, relates to an informationhandling system. The system includes a first data storage siteconfigured to compute a first hash value based on first data stored atthe first data storage site, utilizing a hash function. Likewise, thesystem includes a second data storage site, having data replicated fromthe first data storage site, and similarly configured to compute asecond hash value based on second data stored at the second data storagesite, utilizing the same hash function. Either or both of the first datastorage site and second data storage site may be configured to transmitits computed hash value via a computer network to the other site forcomparison of the first hash value with the second hash value so as todetermine whether the second data is a valid replication of the firstdata. Typically, the first data storage site and the second data storagesite are remote from one another. A mismatch during the comparison ofthe first and second hash values generally indicates that either thefirst or second data storage site or both includes invalid data. Inadditional embodiments, the first data storage site may be configured tomodify the first data based on seed data prior to computing the firsthash value, and the second data storage site may be configured to modifythe second data based on the same seed data prior to computing thesecond hash value.

The present disclosure, in yet another embodiment, relates to a methodfor confirming the validity of replicated data at a data storage site.The method includes utilizing a hash function, computing a first hashvalue based on a selected portion of first data at a first data storagesite and utilizing the same hash function, computing a second hash valuebased on a selected portion of second data at a second data storagesite, wherein the first data had previously been replicated from thefirst data storage site to the second data storage site as the seconddata. The selected portion of replicated second data corresponds to theselected portion of first data. Upon computation of the hash values, thefirst and second hash values may be compared so as to determine whetherthe selected portion of second data is a valid replication of theselected portion of first data. These steps may be repeated any suitablenumber of times, each time utilizing a different selected portion of thefirst data and corresponding selected portion of the second data than ina previous time. The results of such a repeated process aresubstantially representative of whether the entire second data is avalid replication of the entire first data. For example, each subsequentrepetition in contiguous chain of repetitions resulting in a match ofthe first and second hash values increases the likelihood that thesecond data is indeed a valid replication of the first data. The processcan be repeated according to any manner; however, in one embodiment, theprocess is repeated according to a predetermined periodic cycle.

While multiple embodiments are disclosed, still other embodiments of thepresent disclosure will become apparent to those skilled in the art fromthe following detailed description, which shows and describesillustrative embodiments of the invention. As will be realized, thevarious embodiments of the present disclosure are capable ofmodifications in various obvious aspects, all without departing from thespirit and scope of the present disclosure. Accordingly, the drawingsand detailed description are to be regarded as illustrative in natureand not restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

While the specification concludes with claims particularly pointing outand distinctly claiming the subject matter that is regarded as formingthe various embodiments of the present disclosure, it is believed thatthe invention will be better understood from the following descriptiontaken in conjunction with the accompanying Figures, in which:

FIG. 1 is a schematic of a disk drive system suitable with the variousembodiments of the present disclosure.

FIG. 2 is a concept drawing of a method for confirming data consistencyin a data storage environment in accordance with one embodiment of thepresent disclosure.

FIG. 3 is a concept drawing of a method for confirming data consistencyin a data storage environment illustrating a collision.

FIG. 4 is a flow diagram of a method for confirming data consistency ina data storage environment in accordance with one embodiment of thepresent disclosure.

FIG. 5A is a concept drawing of a method for confirming data consistencyin a data storage environment in accordance with another embodiment ofthe present disclosure.

FIG. 5B is a concept drawing of a data confirmation process performedsubsequent the method for confirming data consistency in a data storageenvironment of FIG. 4A.

DETAILED DESCRIPTION

The present disclosure relates to novel and advantageous systems andmethods for confirming data consistency in a data storage environment.Particularly, the present disclosure relates to novel and advantageoussystems and methods for efficiently confirming data consistency betweentwo or more network connected data storage sites in a data storagesubsystem or information handling system, without, for example,consuming too significant amount of network bandwidth between the datasites, and which may be particularly useful in systems with relativelyslower network links or connections or in systems where there is asignificant amount of other communication which should have priority tothe available bandwidth.

For purposes of this disclosure, an information handling system mayinclude any instrumentality or aggregate of instrumentalities operableto compute, calculate, determine, classify, process, transmit, receive,retrieve, originate, switch, store, display, communicate, manifest,detect, record, reproduce, handle, or utilize any form of information,intelligence, or data for business, scientific, control, or otherpurposes. For example, an information handling system may be a personalcomputer (e.g., desktop or laptop), tablet computer, mobile device(e.g., personal digital assistant (PDA) or smart phone), server (e.g.,blade server or rack server), a network storage device, or any othersuitable device and may vary in size, shape, performance, functionality,and price. The information handling system may include random accessmemory (RAM), one or more processing resources such as a centralprocessing unit (CPU) or hardware or software control logic, ROM, and/orother types of nonvolatile memory. Additional components of theinformation handling system may include one or more disk drives, one ormore network ports for communicating with external devices as well asvarious input and output (I/O) devices, such as a keyboard, a mouse,touchscreen and/or a video display. The information handling system mayalso include one or more buses operable to transmit communicationsbetween the various hardware components.

While the various embodiments are not limited to any particular type ofinformation handling system, the systems and methods of the presentdisclosure may be particularly useful in the context of a disk drivesystem, or virtual disk drive system, such as that described in U.S.Pat. No. 7,613,945, titled “Virtual Disk Drive System and Method,”issued Nov. 3, 2009, the entirety of which is hereby incorporated hereinby reference. Such disk drive systems allow the efficient storage ofdata by dynamically allocating user data across a page pool of storage,or a matrix of disk storage blocks, and a plurality of disk drives basedon, for example, RAID-to-disk mapping. In general, dynamic allocationpresents a virtual disk device or volume to user servers. To the server,the volume acts the same as conventional storage, such as a disk drive,yet provides a storage abstraction of multiple storage devices, such asRAID devices, to create a dynamically sizeable storage device. Dataprogression may be utilized in such disk drive systems to move datagradually to storage space of appropriate overall cost for the data,depending on, for example but not limited to, the data type or accesspatterns for the data. In general, data progression may determine thecost of storage in the disk drive system considering, for example, themonetary cost of the physical storage devices, the efficiency of thephysical storage devices, and/or the RAID level of logical storagedevices. Based on these determinations, data progression may move dataaccordingly such that data is stored on the most appropriate coststorage available. In addition, such disk drive systems may protect datafrom, for example, system failures or virus attacks by automaticallygenerating and storing snapshots or point-in-time copies of the systemor matrix of disk storage blocks at, for example, predetermined timeintervals, user configured dynamic time stamps, such as, every fewminutes or hours, etc., or at times directed by the server. Thesetime-stamped snapshots permit the recovery of data from a previous pointin time prior to the system failure, thereby restoring the system as itexisted at that time. These snapshots or point-in-time copies may alsobe used by the system or system users for other purposes, such as butnot limited to, testing, while the main storage can remain operational.Generally, using snapshot capabilities, a user may view the state of astorage system as it existed in a prior point in time.

FIG. 1 illustrates one embodiment of a disk drive or data storage system100 in an information handling system environment 102, such as thatdisclosed in U.S. Pat. No. 7,613,945, and suitable with the variousembodiments of the present disclosure. As shown in FIG. 1, the diskdrive system 100 may include a data storage subsystem 104, which mayinclude, but is not limited to, a RAID subsystem, as will be appreciatedby those skilled in the art, and a disk manager 106 having at least onedisk storage system controller. The data storage subsystem 104 and diskmanager 106 can dynamically allocate data across disk space of aplurality of disk drives or other suitable storage devices 108, such asbut not limited to optical drives, solid state drives, tape drives,etc., based on, for example, RAID-to-disk mapping or other storagemapping technique. The data storage subsystem 104 may include datastorage devices distributed across one or more data sites at one or morephysical locations, which may be network connected. Any of the datasites may include original and/or replicated data (e.g., data replicatedfrom any of the other data sites) and data may be exchanged between thedata sites as desired.

As described above, individuals and businesses seek ways to protect orsecure information, so as to improve, for example, reliability, faulttolerance, or accessibility of that information. One such method ofprotecting information in an information handling system involvesreplicating or sharing information so as to ensure consistency betweenredundant resources, such as data storage devices, including but notlimited to, disk drives, solid state drives, optical drives, tapedrives, etc. During conventional replication processes, when data getstransmitted from an originating site to a destination site, the data istypically, and should be, confirmed as successfully received, ofteninvolving a check, such as CRC, which is specifically designed toprotect against common types of errors during communication. Asdiscussed above, however, over periods of time following transmission,the data at one of the locations may become invalid. While one method ofconfirming that data from a first site is consistent with data at asecond site is to resend the data from one site to the other, suchmethod can put a significant strain on the available network bandwidthbetween the two sites. This demand on bandwidth may be furthercompounded where there are numerous data storage sites scattered overvarious remote locations, each attempting to confirm validity of itsdata.

The present disclosure improves data confirmation/validation processesfor data stored in a data storage system or other information handlingsystem, such as but not limited to the type of data storage systemdescribed in U.S. Pat. No. 7,613,945. The disclosed improvements canprovide more cost effective and/or more efficient dataconfirmation/validation processes, particularly in systems withrelatively slower network links or connections or in systems where thereis a significant amount of other communication which should havepriority to the available bandwidth.

In general, in one embodiment of the present disclosure, at any givenpoint in time subsequent initial transmission of the replicated data, inorder to confirm data consistency between two data sites, such as butnot limited to, an originating or source site and a replication site,the to-be-confirmed data at each site may be hashed using the same hashfunction or algorithm and the hash values may then be compared todetermine whether they are equal. If they are equal, that is, theto-be-confirmed data at each site hashed to the same hash value, it islikely that the underlying data, to which the hash values correspond, isalso the same and therefore valid at both sites. Such method forconfirming data consistency between two data sites can be performedacross a network using as little bandwidth as that required to send thecomputed hashed values from one site to the other, thereby significantlyreducing the amount of bandwidth used as compared to that conventionallyused for confirming data consistency between two sites by resending allthe data.

More specifically, in one embodiment illustrated in FIG. 2, an initialstep may include hashing data 202, or a set of data 204, at a first oneof the data sites 206, such as but not limited to, an originating orsource site, utilizing a specified or predetermined hash function 208 oralgorithm so as to obtain a hash value 210 or set of hash values 212 forthe data or set of data, respectively. Data, as used herein, in additionto its ordinary meaning, is meant to include any logical data unit orportion of a logical data unit, including but not limited to datablocks, data pages, volumes or virtual volumes, disk extents, diskdrives, disk sectors, or any other organized data unit, or portionsthereof. A data set, as used herein, is meant to include a plurality ofdata. Utilizing the same hash function 208, replication data 214 at asecond one of the data sites 216, such as but not limited to, areplication site, corresponding to the data 202 or a set of data 204 atthe first data site may similarly be hashed so as to obtain a hash value218 or set of hash values 220 for the replication data. The hashvalue(s) 210, 212 from the first data site 206 may be transmitted 222,such as via a network, including but not limited to, a LAN, WAN, or theInternet, to the second data site 216 at any time, such as but notlimited to, any time after initial replication, for comparison with thehash value(s) 218, 220 at the second data site, or vice versa, so thatthe hash values can be compared. If the hash value(s) from each site areequal, it is likely that the underlying data 202, 204 at data site 206and the underlying data 214 at data site 216 are also the same andtherefore valid at both sites. The hash values generally consumesignificantly less space than the underlying data, and therefore, whensent across a network, would consume significantly less bandwidth. Thiscontrasts with the conventional methods of confirming data consistencybetween sites by resending all the data from one site to the other, asdiscussed above, which can put a significant strain on the availablenetwork bandwidth between the two sites.

In other embodiments, the hash value(s) from one site need notnecessarily be sent to the other, but instead, the hash value(s) fromeach of the sites could be sent to a third site for comparison. However,such embodiments would require two separate transmissions of the hashvalue(s) as opposed to a single transmission of the hash value(s) fromone site to the other. Nonetheless, in some embodiments, it may bebeneficial for the comparison to be performed at a third site.

In one embodiment, in order to ensure the same hash function is utilizedby both sites, the hash function, or an identification of the hashfunction, used may be transmitted from one of the sites to the other. Inparticular embodiments, the hash function, or an identification of thehash function, used may be transmitted from the first data site 206, forexample, prior to computation of the hash values or at generally thesame time as, and along with, the computed hash value(s) 210, 212. Insuch embodiments, any number of hash functions may be provided andavailable for selected use at any given time; all that may be necessaryis for both sites to have knowledge of, and use, the same hash function.

In other embodiments, a single hash function may be provided by thesystem and be stored at, or be accessible to, each data site. Thus,every data site will always have access to the same hash function asused by the others, and needs no further information from the othersites or elsewhere in order to identify the hash function required. Insuch embodiments, no transmission of the hash function between datasites would be necessary. Of course, the single hash function could bereplaced periodically or on some other replacement basis, to help ensuredata reliability and security. In still other example embodiments, theremay be multiple hash functions provided and stored in, for example, ahash table that is directly or indirectly accessible to each of the datasites at any given time. Although a table is described, it is recognizedthat the term table is conceptual and any data structure is suitable. Atany given time, which hash function is selected from the table and usedmay be based on any algorithm. For example, but not limited by, the hashfunction used at any given time may be based on the actual or systemtime at which data confirmation between two sites is performed, and mayalternate between available hash functions as time passes. Whilespecific examples have been provided herein, the present disclosure isnot limited by such examples, and any method, now known or laterdeveloped, for ensuring that a particular hash function is utilized byeach data site during data confirmation may be used and is within thescope of the present disclosure.

A hash function is generally any algorithm or subroutine that maps data,or a data set, often of large and variable lengths, to smaller data setsof a fixed length. For example only, and certainly not limited to, namesof varying lengths for a plurality of people could each be hashed to asingle integer. The values returned by a hash function are oftenreferred to as hash values, hash codes, hash sums, checksums, or simplyhashes. A hash function is deterministic, meaning that for a given inputvalue, it consistently generates the same hash value.

Despite the deterministic nature of a hash function with respect to agiven input, however, a hash function may map several different inputsto the same hash value, causing what is typically referred to as a“collision.” For example only, again using names as inputs, both “JohnSmith” and “Bob Smyth” may hash to the same hash value, depending on thehash algorithm used. Unfortunately, due to the very nature of hashfunctions, which map relatively larger data sets to relatively smallerdata sets, even under the best of circumstances, collisions aremathematically unavoidable.

Accordingly, in the various embodiments described herein, while it islikely that the underlying data at two sites are equal when the hashvalues computed for the data at each site are equal, it is not asure-fire guarantee that the underlying data at the data sites is indeedthe same. For a simple example, as illustrated in FIG. 3, valid Data 1302 at data site 304 was previously transmitted for replication at datasite 306. However, at some point after being sent by data site 304 forreplication at data site 306, either during transmission or sometimeafter, the corresponding replicated data 308 had been rendered invalidwithout knowledge to either data site. During a subsequent dataconfirmation process to confirm validity of the data, in accordance withthe various embodiments of the present disclosure previously discussed,which utilize and transmit hash values rather than full data sets, Data1 302 at data site 304 may hash to a hash value of “0” 310. Whiletypically uncommon with small data sets, although increasingly lessuncommon with relatively larger amounts of data or larger data sets, atreplication data site 306, the corresponding replicated data 308, whichhas been rendered invalid, may similarly produce the same hash value of“0” 312 under the same hash function, despite the fact that thereplicated data 308 and the original Data 1 302 are not indeed equal.When this type of collision occurs, the system may not recognize thatthe data at one of the sites is invalid, and would not likely take stepsto rebuild the data. Thus, this type of collision may result in what isreferred to herein as an “undetected error” in the corresponding databetween the data sites, in that the error would continue to goundetected despite continuing efforts under such data confirmationprocesses. Notwithstanding the potential for undetected errors, for someapplications, such as but not limited to, those applications with smallamounts of data or where the data being replicated is substantially orrelatively unimportant, when the hash values from each data site areequal, it can be very likely that the underlying data at the data sitesis also the same. In most cases, depending at least partly on the typeof logical data unit hashed and the hash function used, the probabilityof getting a collision and an undetected error may be so small that thereduction in bandwidth usage outweighs the risk of getting such acollision.

However, in applications with, or in data storage systems having,relatively large amounts of data or relatively large data sets, thevalue of results obtained simply by hashing the data at two sites andcomparing the hashed values may decrease significantly. The value of theabove-discussed methods of data confirmation may decrease because, asthe data size increases, the amount of data hashing to the same hashvalue increases, thereby increasing collisions. As collisions increase,the risk of undetected errors can also increase.

In view of the foregoing, the above-discussed methods for confirmingdata consistency may include additional steps or techniques to increasethe reliability of the data confirmation results and reduce theprobability that a collision resulting in an undetected error willoccur. While various embodiments of the present disclosure will bedescribed in more detail below, generally, additional embodiments mayutilize hash seeds and/or hash functions with cryptographic properties,thereby exhibiting good avalanche effects, or in some cases, relativelyhigh avalanche effects. In addition, the same to-be-confirmed data maybe repeatedly hashed and verified over time, with each repetitionincluding a change in one or more of: the hash function used, the hashseed used, or the starting and/or endpoint of the data hashed within theto-be confirmed data. With each subsequent pass of the data confirmationprocess, including at least one change, a positive comparison of thehash values increases the confidence that the underlying data is, infact, identical. For example, a positive match of hash values in a firstpass will give some indication that the underlying data is likely thesame. A positive match of hash values in a second pass, where the secondpass includes at least one change as described above, will increase theconfidence that the underlying data is the same. A positive match ofhash values in a third or more pass, where each subsequent pass includesat least one change as described above, will further increase theconfidence that the underlying data is the same. In general, the morepasses completed and resulting in a positive match of hash values, themore likely the underlying data is, in fact, the same.

More specifically, in one additional embodiment, to increase thereliability of the data confirmation results and reduce the probabilityof a collision resulting in an undetected error, the original data maybe altered in a predetermined manner by adding a hash “seed” to the dataat each site prior to, and for the purposes of, hashing thecorresponding data at each site into respective hash values. The hashseed may be prepended to the data, appended to the data, or may beotherwise inserted at any suitable location within the data, andgenerally, may be combined or associated with the original data, or setof data, so as to alter the original data in a predetermined manner.More particularly, a hash seed may be any predetermined data of anysize, such as but not including, an integer, a word, or a bit, byte orany other sized chunk of organized data having a predetermined value.However, typically the hash seed value would be very small, so that thehash seed conforms with the effort to reduce bandwidth usage between thedata sites for data confirmation processes. In alternative embodiments,a hash seed could be a modifying algorithm or function that takes theoriginal data as input and outputs modified data, which is modifiedaccording to a predefined deterministic algorithm.

As indicated above, additional embodiments may utilize hash functionswith cryptographic properties, which typically exhibit good avalancheeffects, or in some cases, relatively high avalanche effects. Whenrelatively high avalanche effects are present in a hash function,whenever the input is changed, even ever so slightly (for example, bychanging a single bit), the output changes significantly or evendrastically. In this regard, providing a hash seed to alter the data ina predetermined manner can cause significant changes in the computedhash values, thereby reducing the risk that a collision will occur wherethe underlying data is not, in fact, equivalent at both data sites.

A process for confirming the data between two data sites utilizing ahash seed is generally carried out in much the same manner as describedabove with respect to FIG. 1, with the additional step(s) of adding,combining, or otherwise associating a hash seed to the data, or a setsof data, at each data site in order to alter the data in a predeterminedmanner prior to hashing the data to their respective hash values. Thehash seed may be randomly generated, or may be generated according toany suitable algorithm. Similar to the manners described above forensuring that the same hash function is used at each data site, in oneembodiment, in order to also ensure the same hash seed is utilized byeach data site, the hash seed, or an identification of the hash seed,used may be transmitted from one data site to the other. In furtherembodiments, the hash seed or an identification of the hash seed may betransmitted from a first data site to a second data site, for example,prior to computation of the hash values or at generally the same timeas, and along with, the hash function used by, and the computed hashvalue(s) obtained at, the first data site. In such embodiments, anynumber of hash seeds may be provided and available for selected use atany time; all that may be necessary is for both sites to have knowledgeof, and use, the same hash seed. Indeed, as discussed above, a hash seedcould be randomly generated, as long as the same randomly generated hashseed is utilized at each data site during the data confirmation process.Also like the hash function, in other embodiments, a single hash seedmay be provided by the system and be stored at, or accessible to, eachdata site. Thus, every data site will always have access to the samehash seed used by the others, and needs no further information from theother sites or elsewhere in order to identify the hash seed required. Insuch embodiments, no transmission of the hash seed between data siteswould be necessary. Of course, the single hash seed could be replacedperiodically or on some other replacement basis, to help ensure datareliability and security. In still other example embodiments, multiplehash seeds may be provided in, for example, a hash seed table that isdirectly or indirectly accessible to each of the data sites at any giventime. At any given time, which hash seed is selected from the table andused may be based on any algorithm. For example, but not limited by, thehash seed used at any given time may be based on the actual or systemtime at which data confirmation between two sites is performed, and mayalternate between available hash seeds as time passes. While specificexamples have been provided herein, the present disclosure is notlimited by such examples, and any method, now known or later developed,for ensuring that a particular hash seed is utilized by each data siteduring data confirmation may be used and is within the scope of thepresent disclosure.

Altering the original data in a predetermined manner prior to hashing byutilizing a hash seed, as described above, can reduce the likelihood ofa collision resulting in an undetected error. Additionally utilizing ahash function with good avalanche effect can further reduce thelikelihood of a collision resulting in an undetected error.Specifically, if a comparison of the hash values between the data sites,now based on seeded data and a hash function with good avalanche effect,indicates that the hash values from each site are equal, then there isgenerally a very strong likelihood that the underlying data at each datasite is also the same and thus valid at both sites.

With reference again to FIG. 3, for example, presume that both Data 1302 at data site 304 and the purported corresponding invalid data 308 atreplication data site 306 were each prepended, or otherwise combined,with a hash seed, the hash seed being the same for each. Of course,likely depending partly on the complexity of the hash seed selected andthe avalanche effect of the hash function utilized, in general, it isvery unlikely that the Data 1 302 and the invalid data 308 would stillhash to the same hash value. Thus, utilizing a hash seed can furtherincrease the reliability of the various data confirmation processesdescribed herein. Similarly, despite an additional transmission of ahash seed, transmitting the hash seed, hash function, and computed hashvalues between the data sites still generally consumes significantlyless network bandwidth than retransmitting all of the underlying data,as done conventionally.

As indicated above, the same to-be-confirmed data may additionally, berepeatedly hashed and verified over time, with each repetition includinga change in one or more of characteristic of the hashing process, suchas but not limited to: the hash function used, the hash seed used, orthe starting and/or endpoint of the data hashed within the to-beconfirmed data. Each subsequent pass of such data confirmationprocesses, resulting in a positive comparison of the hash valuesincreases the confidence that the underlying data is, in fact,identical. The more passes completed and resulting in a positive matchof hash values, the more likely the underlying data is, in fact, thesame.

More specifically, in some additional embodiments of the presentdisclosure, the data confirmation process may be repeated for any dataor set of data with a different hash function. Likewise, in embodimentswhere a hash seed is utilized, the data confirmation process for anydata or set of data may be repeated with a different hash functionand/or a different hash seed. As will be appreciated, repeating the dataconfirmation process utilizing a different hash function and/or adifferent hash seed will typically result in different hash values forthe same original data as compared to those generated in a previouslyperformed confirmation process. Because it is very unlikely that unequaldata (e.g., original Data 1 302 and replicated data 308 in FIG. 3) beingcompared between two sites would hash to equivalent hash values in botha first data confirmation process, utilizing a first hash function andoptionally a hash seed, and a second data confirmation process,utilizing a different hash function and/or hash seed, it is veryunlikely that a collision would go undetected. Accordingly, whererepeated data confirmation processes result equal hash values betweenthe data sites, the confidence that the underlying data at those datasites is also the same is increased significantly. In this way, arepeated confirmation process can act as a sort of double-check toconfirm the results of any previous confirmation process.

With reference to FIG. 4, an example method of repeated dataconfirmation is illustrated in a flow diagram. While illustrated withrespect to actions performed at an originating or source site, it isrecognized that similar steps may be performed at the destination site.Additionally, while the example method of FIG. 4 is discussed withrespect to certain steps, it is recognized that not every embodimentwill include each step illustrated in FIG. 4, that some embodiments mayinclude additional steps, and that in other embodiments, the steps maybe performed in another order. In step 402 of the example method of FIG.4, a hash function may be selected or provided. As discussed above, insome cases, there may only be a single hash function available, while inother cases, several available hash functions may be selected from. Inaddition, in step 402, if a hash seed is optionally used, a hash seedmay be selected, provided, or otherwise generated. In step 404, the hashfunction and any optional hash seed may be transmitted to thedestination site, so that it is ensured that the destination site hasavailable the same hash function and hash seed. Of course, as discussedabove, any other suitable method of ensuring that both sites utilize thesame hash function and/or hash seed may be utilized, and transmittingthe hash function and/or hash seed from the source site to thedestination site is but one example. In step 406, if a hash seed isoptionally being utilized, the data or data set may be added, combined,or otherwise associated with the hash seed to alter the data or data setin a predetermined manner, as discussed above. In step 408, a firstblock of the data or data set, as optionally modified by a hash seed,may be hashed using the selected or provided hash function. If the dataor data set comprises more than one block, as illustrated in FIG. 4,each block is hashed in a similar manner to that of the first block.Either as the data is hashed, or generally immediately or shortlythereafter, in step 410, the hash values or set of hash values may betransmitted to the destination site for comparison to the hash valuescomputed thereat, as described above. Some period of time later, whetherbased on a periodic schedule, random schedule, administrator'sinstruction, or other method, the process may be repeated on the samedata or data set, initiating again at step 402 with a hash functionbeing selected or provided and an optional hash seed being selected,provided, or otherwise generated. Typically, either or both the hashfunction or hash seed are changed to increase the confidence of the dataconfirmation process with each pass.

The process can be performed or repeated in this manner, with varyinghash functions and/or, if a hash seed is used, varying hash seeds, anysuitable or desired number of times. In some embodiments, the more timesthe process is repeated with varying hash functions and/or hash seeds,the more the confidence that the underlying data is valid at each datasite is increased. Indeed, the data confirmation process may be repeatedas many times as may be desired so as to obtain a specified or requiredconfidence level, which may vary depending partly on the type of datastored and the significance or importance thereof. Because transmittingthe hash function, computed hash values, and optionally a hash seedbetween the data sites generally consumes such a significantly lessamount of network bandwidth than in conventional methods, the dataconfirmation process could be repeated several times over a given periodof time without significantly compromising the available bandwidth.

FIGS. 5A and 5B help illustrate another embodiment for confirming dataconsistency as an alternative to or in addition to utilizing a hash seedas described above. As illustrated in FIG. 5A, for any given data 502having a plurality of addressable units 504 at a first data site,instead of hashing all the data, a subset 506 (e.g., Addresses 1-5 inFIG. 5A) of addressable units may be selected for hashing in accordancewith the various embodiments described above, and the resulting hashvalue may be compared with a hash value computed for replicated data ata second data site corresponding to the addressable units in theselected subset 506. In some embodiments, the result of the comparisonof the hash values computed for the data corresponding to subset 506 maybe representative of the validity of all the data 502.

In a subsequent or repeated data confirmation process, as illustrated inFIG. 5B, a different subset 508 (e.g., Addresses 3-7) of addressableunits from the data 502 may be hashed and compared with a hash valuecomputed for replicated data at the second data site corresponding tothe addressable units in the newly selected subset 508. Again, theresult of the comparison of the hash values computed for the datacorresponding to subset 508 may be representative of the validity of allthe data 502. However, because the addressable units selected for thesubsets 506 and 508 differ, if both the first and subsequentconfirmation processes result in matching hash values, then theconfidence that all the data 502 is valid at both sites increases. Anysuitable number of additional data confirming processes may be run in asimilar manner, with each process selecting a different subset of datafrom the previous process.

Performing a plurality of confirmation processes utilizing varioussubsets 506, 508 of data 502, as described above, increases thereliability of the data confirmation results for all the data 502 andreduces the probability of a collision resulting in an undetected error.The reliability is increased because if there is any invalid portion ofdata 502 at one of the sites, it is very likely that at least one of theselected subsets would result in a mismatch of hash values, therebyindicating an error in the data at one of the sites. In someembodiments, the more times the process is repeated with varying subsetsof data, the more the confidence that the underlying data is valid ateach data site is increased.

Although illustrated as subsets comprised of contiguous addressableunits, subsets 506, 508 need not be comprised of contiguous addressableunits, but could be any organized subset of data. Similarly, althoughshown with eight addressable units in FIGS. 5A and 5B, subsets 506, 508could include fewer or greater addressable units. Likewise, the numberof addressable units included in a subset and hashed during any givendata confirmation process need not be the same as any previousconfirmation process. That is, for any given data confirmation process,a subset could include any suitable number of addressable units withoutregard to how many addressable units are used in any previous orsubsequent data confirmation process. Additionally, any particularaddressing method for addressing the addressable units may be utilized.Furthermore, in some embodiments, the subsets need not be defined bytraditional addressable units, but rather could be defined by anymethod, such as but not limited to, simply defining a starting andending point within in the data 502.

Any of the various embodiments for data confirmation of the presentdisclosure may be run at any time, and any of the various embodimentsfor data confirmation of the present disclosure may be triggered by anysuitable method, such as but not limited to, triggered manually by anadministrator, triggered automatically by the data storage subsystem ora controller or other processing device located at one of the datasites, triggered automatically based on a triggering event, or triggeredrandomly. A triggering event could be any type of event, including butnot limited to, a particular date and/or time, when a particular levelof network bandwidth is available, a transition from peak time tonon-peak time, or vice versa, based on, for example, historical data orstandardized data relating to peak times, or any combination of events,etc. In other embodiments, any of the methods for data confirmation ofthe present disclosure may be run generally continuously orsemi-continuously, for example, as a background process of the datastorage subsystem. In some embodiments, as used herein, the termscontinuously and semi-continuously may be defined by the typicalunderstanding of those terms as used in the art or defined by well-knowndictionaries. For example, the term continuously may be defined as anuninterrupted extension in space, time, or sequence, and the termsemi-continuously may be defined as a substantially uninterruptedextension in space, time, or sequence. In other embodiments, the termcontinuously may refer to embodiments of the present disclosure that areconfigured to run one or more data confirmation processes,simultaneously, sequentially, or both, over an extended period of time,such as for more than two consecutive hours, and are generally given theability to consume resources without interruption for substantially theentire period of time. Similarly, in other embodiments, the termsemi-continuously may refer to embodiments of the present disclosurethat are configured to run one or more data confirmation processes, atleast periodically, over an extended period of time, such as for morethan two consecutive hours, and are generally given the ability toconsume resources for at least more than half the time. Additionally,any of the various embodiments for data confirmation of the presentdisclosure may be configured so as to run more heavily during periods ofrelatively increased system activity and less heavily during periods ofrelatively decreased system activity, so as not to significantly impactor interfere with normal or regular system performance or utilizesignificant amounts of network bandwidth that could otherwise be usedfor other system activity. Further, while any of the various embodimentsfor data confirmation of the present disclosure may be run generallycontinuously or semi-continuously, it is recognized that the dataconfirmation processes need not run at a consistent level for the entireperiod of time they are running continuously or semi-continuously and,indeed, could be periodically halted and restarted at various times forany desired or suitable reason, such as but not limited to, byadminister request or based on system demand. In still furtherembodiments, any of the various embodiments for data confirmation may berun based on a predefined periodic schedule, such as but not limited to,at a specified time each day or week.

In some embodiments, one or more data reports may be generated from timeto time identifying the results of any data confirmation processes runover a specified period of time. Such reports could include astatistical analysis of the validity of data between two or more datasites. In one particular embodiment, for example, a data report couldprovide a value indicative of the confidence that all, or a specifiedportion of, data at each data site is valid. The report(s) could be usedto analyze whether the data confirmation processes should be modified toincrease, or even decrease, if desired, the confidence level, such as bymodifying how often the data confirmation processes run, when and forhow long the data confirmation processes run, which data the dataconfirmation processes run on, the hash function or hash seed used, etc.In further embodiments, the data storage system may automaticallygenerate reports and utilize the data obtained from the report toautomatically adjust the data confirmation processes in order to achievea predefined or otherwise specified confidence level.

The various embodiments of the present disclosure relating to systemsand methods for confirming data consistency in a data storageenvironment provide significant advantages over conventional systems andmethods for systems and methods for confirming data consistency in adata storage environment, which generally involve resending entirechunks of the underlying data between data sites for comparison. Forexample, the various embodiments of the present disclosure may consumesignificantly less network bandwidth in order to confirm the validity ofdata between data sites, thus saving available bandwidth for otherrelatively more important, or even less important, system activity, suchas but not limited to, initial replication of the data. Althoughcertainly not limited to such systems, the various embodiments of thepresent disclosure may be particularly useful in systems with relativelyslower network links or connections or in systems where there is asignificant amount of other communication which should have priority tothe available bandwidth.

In addition, repeated hash processes over periods of time, with eachrepetition including a change in one or more of characteristic of thehashing process, as described herein can increase the confidence thatthe underlying data is, in fact, identical. The more passes completedand resulting in a positive match of hash values, the more likely theunderlying data is, in fact, the same.

Additionally, in many cases, the owner of the data site where theoriginal data is stored and the owner(s) of any data sites wherereplicated data is sent and stored are not the same. Without directaccess to the data stores and supporting hardware and software at thereplication sites, the owner of the original data site may, at best,only be able to rely on information provided by the owner(s) of theother data sites regarding the validity of data stored thereat. Theowner of the original data site, however, may like to have a method forconfirming, through its own system, the reliability of the data storedat the replication sites. Preferably, the owner may like to do sowithout significant impact on its system's performance. The variousembodiments of the present disclosure can fill this need.

In the foregoing description various embodiments of the presentdisclosure have been presented for the purpose of illustration anddescription. They are not intended to be exhaustive or to limit theinvention to the precise form disclosed. Obvious modifications orvariations are possible in light of the above teachings. The variousembodiments were chosen and described to provide the best illustrationof the principals of the disclosure and their practical application, andto enable one of ordinary skill in the art to utilize the variousembodiments with various modifications as are suited to the particularuse contemplated. All such modifications and variations are within thescope of the present disclosure as determined by the appended claimswhen interpreted in accordance with the breadth they are fairly,legally, and equitably entitled.

1-20. (canceled)
 21. A method for confirming the validity of replicateddata at a data storage site, the method comprising: a) replicating firstdata from a first computer readable storage medium at a first datastorage site as second data to a second computer readable storage mediumat a second data storage site; b) utilizing a hash function, computing afirst hash value based on the first data stored on the first computerreadable storage medium at the first data storage site, the first hashvalue being smaller in size than the first data; c) utilizing the samehash function, computing a second hash value based on the second datastored on the second computer readable storage medium at the second datastorage site, the second hash value being smaller in size than thesecond data; d) transmitting at least one of the first or second hashvalues via a computer network for comparing with the other of the firstor second hash values, in place of retransmitting the larger sized firstor second data via the computer network; and e) comparing the first andsecond hash values, in lieu of comparing the actual first and seconddata, to determine whether the second data is a valid replication of thefirst data, wherein a mismatch between the first and second hash valuesindicates that at least one of the first or second data storage sitesincludes invalid data.
 22. The method of claim 21, wherein the first andsecond data storage sites are remotely connected by the computernetwork.
 23. The method of claim 21, further comprising providing a datastructure storing a plurality of hash functions computable by a computerprocessor, each being available for use by the first and second datastorage sites.
 24. The method of claim 23, further comprising selectingthe hash function from the data structure storing a plurality of hashfunctions for utilization in computing the first and second hash values.25. The method of claim 21, further comprising modifying the first databased on seed data prior to computing the first hash value and modifyingthe second data based on the seed data prior to computing the secondhash value.
 26. The method of claim 25, further comprising transmittingthe seed data via the network from at least one of the first or seconddata storage sites to the other of the first or second data storagesites for use by both first and second data storage sites.
 27. Themethod of claim 21, further comprising: utilizing a second hashfunction, computing a third hash value based on the first data stored onthe first computer readable storage medium at the first data storagesite; utilizing the second hash function, computing a fourth hash valuebased on the second data stored on the second computer readable storagemedium at the second data storage site; transmitting at least one of thethird or fourth hash values via a computer network for comparing withthe other of the third or fourth hash values; and comparing the thirdand fourth hash values, in lieu of comparing the actual first and seconddata, to determine whether the second data is a valid replication of thefirst data.
 28. The method of claim 25, further comprising: modifyingthe first and second data based on second seed data; utilizing the hashfunction, computing a third hash value based on the modified first data;utilizing the second hash function, computing a fourth hash value basedon the modified second data; and comparing the third and fourth hashvalues to determine whether the second data remains a valid replicationof the first data.
 29. The method of claim 21, further comprisingrepeating steps b) through d) a plurality of times, each time utilizinga different hash function than for a previous time.
 30. The method ofclaim 29, wherein the steps b) through d) are repeated according to apredetermined periodic cycle.
 31. The method of claim 21, furthercomprising repeating steps b) through d) a plurality of times.
 32. Themethod of claim 31, wherein the steps b) through d) are repeatedaccording to a predetermined periodic cycle.
 33. An information handlingsystem comprising: a first data storage site comprising a computerreadable storage medium storing first data, and in operablecommunication with a computer processor computing a first hash valuebased on the first data, utilizing a hash function; and a second datastorage site comprising a computer readable storage medium storing datareplicated from the first data storage site, and in operablecommunication with a computer processor computing a second hash valuebased on second data, utilizing the same hash function; wherein in lieuof comparison of the first and second data, the computed first andsecond hash values are compared as an approximation of whether thesecond data is a valid replication of the first data, wherein a mismatchbetween the first and second hash values indicates that at least one ofthe first or second data storage sites includes invalid data.
 34. Theinformation handling system of claim 33, wherein the first data storagesite and the second data storage site are remotely connected via acomputer network.
 35. The information handling system of claim 34,wherein the first data storage site is configured to modify the firstdata based on seed data prior to computing the first hash value and thesecond data storage site is configured to modify the second data basedon the seed data prior to computing the second hash value.
 36. Theinformation handling system of claim 33, wherein computer processor inoperable communication with the first data storage site and the computerprocessor in operable communication with the second data storage siteare the same computer processor.
 37. The information handling system ofclaim 36, wherein the computer processor is remote to at least one ofthe first and second data storage sites.
 38. A method for confirming thevalidity of replicated data at a data storage site, the methodcomprising: a) replicating first data from a first computer readablestorage medium at a first data storage site as second data to a secondcomputer readable storage medium at a second data storage site; b)utilizing a hash function, computing a first hash value based on aselected portion of the first data stored on the first computer readablestorage medium at the first data storage site; c) utilizing the samehash function, computing a second hash value based on a selected portionof the second data stored on the second computer readable storage mediumat the second data storage site, the selected portion of second datacorresponding to the selected portion of first data; d) transmitting atleast one of the first or second hash values via a computer network forcomparing with the other of the first or second hash values, in place ofretransmitting the first or second data via the computer network; e)comparing the first and second hash values, in lieu of comparing theactual first and second data, as an approximation of whether theselected portion of second data is a valid replication of the selectedportion of first data; and f) repeating steps b) through e) a pluralityof times, each time utilizing a different selected portion of the firstdata and corresponding selected portion of the second data than for aprevious time, wherein during any repetition a mismatch between thefirst and second hash values indicates that at least one of the first orsecond data storage sites includes invalid data.
 39. The method of claim38, wherein the first and second data storage sites are remotelyconnected by the network.
 40. The method of claim 39, wherein the stepsb) through e) are repeated according to a predetermined periodic cycle,each subsequent repetition in a contiguous chain of repetitionsresulting in a match of the first and second hash values increasing thelikelihood that the second data is a valid replication of the firstdata.